Multilingual Language Identification: ALTW 2010 Shared Task Data
نویسندگان
چکیده
While there has traditionally been strong interest in the task of monolingual language identification, research on multilingual language identification is underrepresented in the literature, partly due to a lack of standardised datasets. This paper describes an artificially-generated dataset for multilingual language identification, as used in the 2010 Australasian Language Technology Workshop shared task.
منابع مشابه
Hierarchical classification for Multilingual Language Identification and Named Entity Recognition
This paper describes the approach for Subtask-1 of the FIRE2015 Shared Task on Mixed Script Information Retrieval. The subtask involved multilingual language identification (including mixed words and anomalous foreign words), named entity recognition (NER) and subclassification. The proposed methodology starts with cleaning the data and then extracting structural and contextual features from th...
متن کاملAutomatic Detection and Language Identification of Multilingual Documents
Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate the...
متن کاملWord Level Language Identification in Online Multilingual Communication
Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We ach...
متن کاملData-Driven Dependency Parsing across Languages and Domains: Perspectives from the CoNLL-2007 Shared task
The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, I summarize the main findings from the 2007 shared task and try to i...
متن کاملCoNLL-X Shared Task on Multilingual Dependency Parsing
Each year the Conference on Computational Natural Language Learning (CoNLL)1 features a shared task, in which participants train and test their systems on exactly the same data sets, in order to better compare systems. The tenth CoNLL (CoNLL-X) saw a shared task on Multilingual Dependency Parsing. In this paper, we describe how treebanks for 13 languages were converted into the same dependency ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010